-
Notifications
You must be signed in to change notification settings - Fork 2
Crash and hang fixes #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…sk_cq_delfd() Similarly to what is done in msk_cma_event_handler(), check that trans->comp_channel is still there before calling msk_cq_del_fd(). I had a crah where comp_channel was already freed when msk_cq_thread() was still trying to access it.
msk_destroy_trans() can hang forever if rdma_disconnect() failed. Indeed, if it failed for some reason, no cm disconnect event will be received, and trans->state will not reach MSK_CLOSED. Thread calling msk_destroy_trans() will then block on the msk_cond_wait() desperatly waiting for a state change. Today, msk_destroy_trans() doesn't return an error code. We only log errors. Maybe it could be interesting in the future to change the interface to be able to notify the caller when an error occured during its execution.
martinetd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what caused rdma_disconnect to fail in the first place? some race with destroy? I'm worried that if there is such a race then we could get much worse than a hang, but e.g. issue the disconnect a reused FD instead... It would be good to understand the root cause and ideally fix that as well - we should both do this check and fix the reason it happens as well.
(Style-wise, please make sure the first line of your commit messages are <= 72 characters; github is particularily annoying with long subjects but it is good practice anyway.)
| // closing trans, skip this, will be done on flush | ||
| msk_cq_delfd(trans); | ||
| if (trans->comp_channel) | ||
| msk_cq_delfd(trans); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I realize the old code (the other place where we check comp_channel before calling msk_cq_delfd) has the same problem, but I'm wondering if this is safe race-wise...
If the application calls msk_destroy_trans themselves for a reason or another I believe this could happen; I'd call msk_destroy_qp with trans->cm_lock taken and move the other msk_cq_delfd within the locked section as well. What do you think?
(In the first place we probably don't need a different comp_channel for each child trans now that I'm reading this code again, if we call ibv_create_cq with the "parent's" comp_channel it should just work.. We'd need to shuffle code a bit and call ibv_get_cq_event to grab the cq and get the trans from its cq_context though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I've just realized I left this PR on its own for a long time.
I don't remember exactly how the hang occurs. I think it was related with processes that fork.
(I've some code for adding some kind of support for fork in mooshika (this is something we need to support in our app even if fork should be avoided when using verbs))
Concerning the crash, I don't know the internals of mooshika very well, but I tend to agree with you about doing the msk_destroy_qp() and msk_cq_delfd() with the cm_lock held.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder how ib deals with forking when memory is locked and already registered... I'll have to admit I never tried.
Anyway, how do you want to proceed? Can you still reproduce somehow to test if holding the cm_lock also fixes the problem?
Given I don't have the reproducer I can't really go about adding the lock and closing this if I don't know if it really helps; but I'd rather not merge this one if the race gets fixed with a lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the fork() + ib, the general advice is: don't fork!
Anyway there is the interface ibv_fork_init() that's supposed to help with fork() support. My understanding is it keeps track of registered memory regions and mark them as "don't fork".
I've a branch in my mooshika repo which adds an helper to expose ibv_fork_init() (via msk_fork_init(), clever name, isn't it) and adds a new interface msk_lib_reset() similar to what is proposed by the Mellanox version of librdmacm: rdma_lib_reset(), which can be called to reset the library global state in a child process. I will create a PR for it sometime.
About the reproducer for the hang, I think it happens while I was doing tests around the fork(), and msk_destroy_trans() was called twice for the same transport (probably once in the parent and once in the child). It was in some faulty code.
This pull request fixes one crash and one hang observed while using Mooshika.
See commit logs for a bit more details.